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Abstract 

In this work we study the parallel coordinate descent method (PCDM) proposed by Richtarik 
and Takac [26j for minimizing a regularized convex function. We adopt elements from the work 
of Lu and Xiao |39j . and combine them with several new insights, to obtain sharper iteration 
complexity results for PCDM than those presented in [26]. Moreover, we show that PCDM is 
monotonic in expectation, which was not confirmed in [26], and we also derive the first high 
probability iteration complexity result where the initial levelset is unbounded. 


1 Introduction 

Block coordinate descent methods are being thrust into the optimization spotlight because of a 
dramatic increase in the size of real world problems, and because of the “Big data” phenomenon. 
It is little wonder, when these seemingly simple methods, with low iteration costs and low memory 
requirements, can solve problems where the dimension is more than one billion, in a matter of hours 

m- 

There is an abundance of coordinate descent variants arising in the literature including: [4] 
ElEllIIlliailSllIllEalMlEZlESlEIlEaESlISlESlIMlETlISH]. The main differences between 
these methods is the way in which the block of coordinates to update is chosen, and also how the 
subproblem to determine the update to apply a block of variables is to be solved. The current, 
state-of-the-art block coordinate descent method is the Parallel (block) Coordinate Descent Method 
(PCDM) of Richtarik and Takac [26]. This method selects the coordinates to update randomly 
and the update is determined by minimizing an overapproximation of the objeetive funetion at 
the current point (see Section [3] for a detailed description). PCDM can be applied to a problem 
with a general convex composite objective, it is supported by strong iteration complexity results 
to guarantee the method’s convergence, and it has been tested numerically on a wide range of 
problems to demonstrate its practical capabilities. 

In this work we are interested in the following convex composite/regularized optimization prob¬ 
lem 

min F{x) = f{x) + 'I'(x), (1) 

a:GR^ 

where we assume that f{x) is a continuously differentiable convex function, and T(x) is assumed 
to be a (possibly nonsmooth) block separable convex regularizer. 

The Expected Separable Overapproximation (ESO) assumption introduced in [26] enabled the 
development of a unified theoretical framework that guarantees convergence of a serial [25] , parallel 
[26] and even distributed [211111123] version of PCDM. To benefit from the ESO abstraction, we 
derive all the results in this paper based on the assumption that / admits an ESO with respect to 
a uniform block sampling S. This concept will be precisely defined in Section 13.21 For now it is 
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enough to say that updating a random set of r coordinates (selected uniformly at random) is one 
particular uniform sampling and the ESO enables us to overapproximate the expected value of the 
function at the next iteration by a separable function, which is easy to minimize in parallel. 

1.1 Brief literature review 

Nesterov jTS] provided some of the earliest iteration complexity results for a serial Randomized 
Coordinate Descent Method (RCDM) for problems of the form ([1]), where 1' = 0, or is the indicator 
function for simple bound constraints. Later, this work was generalized to optimization problems 
with a composite objective of the form ([1]), where the function is any (possibly nonsmooth) 
convex (block) separable function [25l [26] . 

One of the main advantages of randomized coordinate descent methods is that each iteration is 
extremely cheap, and can require as little as a few multiplications in some cases m- However, a 
large number of iterations may be required to obtain a sufficiently accurate solution, and for this 
reason, parallelization of coordinate descent methods is essential. 

The SHOTGUN algorithm presented in [T] represents a naive way of parallelizing RCDM, 
applied to functions of the form ([T|) where T = || • ||i. They also present theoretical results to show 
that parallelization can lead to algorithm speedup. Unfortunately, their results show that only a 
small number of coordinates should be updated in parallel at each iteration, otherwise there is no 
guarantee of algorithm speedup. 

The first true complexity analysis of Parallel RCDM (PCDM) was provided in [2^ after the 
authors developed the concept of an Expected Separable Overapproximation (ESO) assumption, 
which was central to their convergence analysis. The ESO gives an upper bound on the expected 
value of the objective function after a parallel update of PCDM has been performed, and depends on 
both the objective function, and the particular ‘sampling’ (way that the coordinates are chosen) that 
was used. Moreover, several distributed PCDMs were considered in [2l ll4l[23] and their convergence 
was proved simply by deriving the ESO parameters for particular distributed samplings. 

In [31 [To] the accelerated PCDM was presented and its efficient distributed implementation was 
considered in |2|. Recently, there has also been a focus on PCDMs that use an arbitrary sampling 
of coordinates naisoiiiiiiMi. 

1.2 Summary of contributions 

In this section we summarize the main contributions of this paper (not in order of significance). 

1. No need to enforce “monotonicity”. PCDM in [26] was analyzed (for a general convex 
composite function of the form ([1])) under a monotonicity assumption; if, at any iteration 
of PCDM, an update was computed that would lead to a higher objective value than the 
objective value at the current point, then that update is rejected. Hence, PCDM presented 
in [26] included a step to force monotonicity of the function values at each iteration. In this 
paper we confirm that the monotonicity test is redundant, and can be removed from the 
algorithm. 

2. First high-probability results for PCDM without levelset information. Currently, 
the high probability iteration complexity results for coordinate descent type methods require 
the levelset to be bounded. In this paper we derive the first high-probability result which 
does not rely on the size of the levelset. In particular, the analysis of PCDM in [26] assumes 
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that the levelset {x G : -F(x) < -F(xo)} is bounded for the initial point xq, and under 
this assumption, convergence is guaranteed. However, in this paper we show that PCDM 
will converge, in expectation, to the optimal solution even if the levelset is unbounded (see 
Section [5]) . 

3. Sharper iteration complexity results. In this work we obtain sharper iteration complex¬ 
ity results for PCDM than that those presented in [26] and Table [1] summarizes our findings. 
A thorough discussion of the results can be found in Section 16.21 We briefly describe the vari¬ 
ables used in the table (all will be properly defined in later sections.) Variable c is a constant, 
k is the iteration counter, a G [0,1] is the expected proportion of coordinates updated at each 
iteration, = P(xo) — IT, and u is a (vector) parameter of the method. Also, /r/ and fj,q, 
are the (strong) convexity constants of / and T respectively (both with respect to || ■ ||.(, for 
some v) and e and p are the desired accuracy and confidence level respectively. (C=Convex, 
SC=Strongly Convex). 



Table 1: Comparison of the iteration complexity results for PCDM obtained in [26] and in this 
paper. The analysis used in this paper provides a sharper iteration complexity result in both the 
convex and strongly convex cases when e and/or p are small. 


4. Improved convergence rates for PCDM. In this work we show that PCDM converges 
at a faster rate than that given in |2^, in both the convex and strongly convex cases. Table 
[2] provides a summary of our results and a thorough discussion can be found in Section 16.11 


F 

Richtarik and Takac [2^ 
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Table 2: Comparison of the convergence rates for PCDM obtained in [26] and in this paper. 
(C=Convex, SC=Strongly Convex). The analysis used in this paper provides a better rate of of 
convergence in both the convex and strongly convex cases when e and/or p are small. 
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1.3 Paper outline 

The remainder of this paper is structured as follows. In Section [2] we introduce the notation and 
assumptions that will be used throughout the paper. Section [3] describes PCDM of Richtarik and 
Takac [26] in detail. We also present a new convergence rate result for PCDM, which is sharper 
than that presented in |26|. The proof of the result is given in Section U] along with several necessary 
technical lemmas. 

In Section [5| we present several iteration complexity results, which show that PCDM will con¬ 
verge to an e-optimal solution with high probability. In Section 15.11 we provide the first iteration 
complexity result for PCDM that does not require the assumption of a bounded levelset. The 
results shows that PCDM requires 0{^) iterations, so we have devised a ‘multiple run strategy’ 
that achieves the classical 0{\og^) result. Moreover, in Section [5.11 we present a high probability 
iteration complexity result for PCDM, that assumes boundedness of the levelset, which is sharper 
than the result given in |26j . 

In Section|B]we give a comparison of the results derived in this work, with the results given in [26j . 
Then, we present several numerical experiments in Section [7| to highlight the practical capabilities 
of PCDM under different ESO assumptions. The ESO assumptions are given in Appendix]^ where 
we also provide a new ESO for doubly uniform samplings (see Theorem I19p . 

2 Notation and assumptions 

In this section we introduce block structure and associated objects such as norms and projections. 
The parallel (block) coordinate descent method will operate on blocks instead of coordinates. 

2.1 Block structure 

The problem under consideration is assumed to have block structure and this is modelled by 
decomposing the space into n subspaces as follows. Let U G 130 ^ column permutation 

of the N X N identity matrix and further let U = [Ui, U 2 , ■ ■ ■, Un] be a decomposition of U into 
n submatrices, where Ui is N x Ni and ~ Note that UjUj = iNi when i = j and 

UfUj = 0 (where 0 is the W x Nj matrix of all zeros) when i j. Subsequently, any vector 
X € can be written uniquely as 

n 

x = Y, UiX^^ ( 2 ) 

i=l 

where = Uf x G R'^L For simplicity we will write x = {x^^\x^‘^\ ... 

In what follows let (•, •) denote the standard Euclidean inner product. Then we have 

j n n \ n n n 

{x,y) = = (3) 

\j=i j=i / i=i j=i i=i 

Norms. Further we equip R^“ with a pair of conjugate Euclidean norms: 

\\h\\(^,y.= {BAh)K \\h\\l,^ = {B-^h,h)K (4) 
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where Bi G ig a positive definite matrix. For fixed positive scalars vi,V 2 , ■ ■ ■ ■,Vn, let v = 

(ui,..., Vn)^ and define a pair of conjugate norms in by 


i=l 


max (y, x)^ 
||a;||„<l 



( 5 ) 


Projection onto a set of blocks. 


Let 0 7 ^ 5 C {1, 2,, n}. Then for x G R'^ we write 


^[S] ■= 

ies 


( 6 ) 


and we define X[0] = 0. That is, given x G R^, X[5] is the vector in R-^ whose blocks i £ S are 
identical to those of x, but whose other blocks are zeroed out. 


2.2 Assumptions and strong convexity 

Throughout this paper we make the following assumption regarding the block separability of the 
function 'I'. 

Assumption 1 (Block separability). The nonsmooth function T : R-^ —)■ R U {+oo} is assumed 
to be block separable, i.e., it can he decomposed as: 

n 

= (7) 

i=l 

where the functions Tj : R-^* —>■ R U {+oo} are proper, closed and convex. 

In some of the results presented in this work we assume that F is strongly convex and we denote 
the (strong) convexity parameter of F, with respect to the norm || ■ ||^ for some v G R++, by /Up > 0. 
A function cj) : R^ —>■ R U {+oo} is strongly convex with respect to the norm || • H^, with convexity 
parameter > 0 if for all x,y £ dom cj), 

(j){y) > (f){x) + {(t)'{x),y -x) + ^\\y- x\\l, ( 8 ) 

where cj)' is any subgradient of 0 at x. The case with = 0 reduces to convexity. 

Strong convexity of F may come from / or 'L or both and we will write pf (resp. p^,) for the 
strong convexity parameter of / (resp. T). Following from ([ 8 |) 

PF > Pf + P'ii- ( 9 ) 

From the first order optimality conditions for ([1]) we obtain (T'(x*),x — x*) > 0 for all x G 
domP. Combining this with ([8]) used with y = x and x = x*, yields the standard inequality 

F(x) — F* > ^||x — x*||^, X G domF. (10) 
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3 Parallel coordinate descent method 


In this section we describe the Parallel Coordinate Descent Method (Algorithm [T]) of Richtarik and 
Takac [2^. We now present the algorithm, and a detailed discussion will follow. 


Algorithm 1 PCDM: Parallel Coordinate Descent Method [26] 

1: choose initial point xq E 
2: for A: = 0, 1, 2, ... do 

3: randomly choose set of blocks C {1,..., n} 

4: for i ^ Sk (in parallel) do 

5: compute /i(xfc)W = argmin^gj^iVi |((V/(xfc))W,t) + + ^*(4*^ +t)| 

6: end for 

7: apply the update: Xk+i Xk + Uih{xk)^^'> 

8 : end for 


The algorithm can be described as follows. At iteration k of Algorithm (H a set of blocks Sk 
is chosen, corresponding to the (blocks of) coordinates that are to be updated. The set of blocks 
is selected via a sampling, which is described in detail in Section 13.11 Then, in Steps 4-6, the 
updates h{xk)^^\ for all i E S^, are computed in parallel, via a small/low dimensional minimization 
subproblem. (In Section [3.21 we describe the origin of this subproblem via an ESO.) Finally, in 
Step 7, the updates /i(xfc)(*^ are applied to the current point Xk, to give the new point Xk+i- Notice 
that Algorithm [T] does not require knowledge of objective function values. 

We now describe the key steps of Algorithm [ 1 ] (Steps 3 and 4-6) in more detail. 

3.1 Step IS Sampling 

At the A;th iteration of Algorithm [H a set of indices Sk C {1,..., n} (corresponding to the blocks of 
Xk to be updated) is selected. Here we briefly explain several schemes for choosing the set of indices 
Sk‘, a thorough description can be found in |26] . Formally, Sk is a realisation of a random set-valued 
mapping S with values in . Richtarik and Takac |26| have coined the term sampling in 

reference to S. 

In what follows, we will assume that all samplings are proper. That is, we assume that p* > 0 
for all blocks i, where pi is the probability that the zth block of x is updated. 

We state several sampling schemes now. 

1. Uniform: A sampling S is uniform if all blocks have the same probability of being updated. 

2. Doubly uniform: A doubly uniform sampling is one that generates all sets of equal cardi¬ 
nality with equal probability. That is P(S'') = P(S'") whenever IS"] = IS"'!. 

3. Nonoverlapping uniform: A nonoverlapping uniform sampling is one that is uniform and 
assigns positive probabilities only to sets forming a partition of { 1 ,..., n}. 

In fact, doubly uniform and nonoverlapping uniform samplings are special cases of uniform sam¬ 
plings, so in this work all results are proved for uniform samplings. Other samplings, which are 
also special cases of uniform samplings, are presented in [26], but we omit details of all, except a 
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T-nice sampling, for brevity, 
have 


We say that a sampling S is r-nice, if for any S C {1, 2,... , n} we 


P(5 = S) 


0 , 

r!(n—r)! 


if |5| / r, 
otherwise. 


( 11 ) 


3.2 Step O Computing the step-length 

The block update h{xk)^''^ is chosen in such a way that an upper bound on the expected function 
value at the next iterate is minimized, with respect to the particular sampling S that is used. 
The construction of the expected upper bound should be (block) separable to ensure efficient 
parallelizability. Before we focus on how to construct the expected upper-bound on F we will state 
a definition of ESO. 


Definition 2 (Expected Separable Overapproximation; Definition 5 in [5B]). Let v € and 

S be a proper uniform sampling. We say that f : —>■ R admits an ESO with respect to the 

sampling S with parameter v, if, for all x, /i G R'^ the following inequality holds: 

E|/(i + /.[^)] < f(x) + ({v/(i), h) + i||ft||j) . (12) 

We say that the ESO is monotonic i/V5 G S such that P(5 = 5) > 0 the following holds: 


f{x + hys]) < f{x). 

In Appendix a review of different smoothness assumptions on / and corresponding ESO 
parameters v for a doubly uniform sampling, is given. In all that follows, we assume that / admits 
an ESO, and that v is the ESO parameter and S' is a proper uniform sampling. Then 


E[F(x + /.[^])]i 
< 


E[/(x + h[5])]+E[T(x + /ijg])] 

/(x) + M((V/(x),h) + l||h||2) + 



^(a^) + ™^(x + h), 


(13) 


where we have used that fact that T is block separable and that S is a proper uniform sampling 
(see [Ml Theorem 4]). 

Now, it is easy to see that minimizing the right hand side of (|13p in h is the same as minimizing 
the function LLv in h, where 'H.u is defined to be 

Uvix, h) := /(x) -b (V/(x), h) + ]^\\h\\l + T(x + h). (14) 

In view of ([2]), ([H), and ([7]), we can write 

n 

n,{x, h) := /(x) + ^ {((V/(x))«, + T,(x« + h«)}. 

2=1 

Eurther, we define 

h{x) := arg min (x,/i), (15) 

/iGR^ 

which is the update used in Algorithm [TJ Notice that the algorithm never evaluates function values. 
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3.3 Complexity of PCDM 

We are now ready to present one of our main results, which is a generalization of Theorem 1 in 
|39j . The result shows that PCDM converges in expectation and provides an sharper convergence 
rate than that given in [26] . The proof is provided in Section 01 Let us mention that a similar 
result was given independentl}0 in [15], but that result only holds for the particular ESO described 
in Theorem\21\ However, even for that ESO, our result (Theorem [3]) is still much better because it 
depends on ||xo — and not on the size of the initial levelset (which could even be unbounded). 
We state our result now. 

Theorem 3. Let F* be the optimal value of problem and let {xfc}fc>o be the sequence of iterates 

generated by PCDM using a uniform sampling S. Let a = and suppose that f admits an ESO 
with respect to the sampling S with parameter v. Then for any k >0, 

(i) the iterate Xk satisfies 

B[F{xk) - E*] < ~ ~ 

(a) if fj,f + fiiii > 0, then the iterate x^ satisfies 

E|F(xt) - F.l < (l - (D^lko-J>.|i; + F(a:„)-F.). (17) 

Remark 4. Notice that Theorem 0 is a general result, in the sense that any ESO can be used for 
PCDM and the result holds. 

4 Proof of the main result 

In this section we provide a proof of our main convergence rate result. Theorem [3l However, first 
we will present several preliminary results, including the idea of a composite gradient mapping, 
and other technical lemmas. 

4.1 Block composite gradient mapping 


We now define the concept of a block composite gradient mapping dZl [39]. By the first-order 
optimality conditions for problem (I15|i . there exists a subgradient s^*^ G d'I'j(x^*^ -|- (/i(x))*'*^) (where 
dTj(-) denotes the subdifferential of 'l'i(-)) such that 


(V/(x))« + ViBi{h{x))^^ + s« = 0. 

We define the block composite gradient mappings as 


(18) 

( 5 r(x))W ;= -ViBi{h{x ))^"'^, z = 1 ,..., n. 

Prom (fT8|) and (fT9|) we obtain 


(19) 

- (V/(x))« + ( 5 (x))« € dTi(x« + (h(x))«), z = 1,.. 

preliminary version of this paper was ready in August 2013. 

. ,n. 

(20) 







If we let g{x) := (compare ([2]) and (fT^ I. then since 'I' is separable, (f20]l can be 

written as 


2=1 


— V/(x) + g{x) G d"^{x + h{x)). 


( 21 ) 


Moreover 




ill+iH 


( 22 ) 


2=1 


2 = 1 


and 

{g{x),h{x)) -\\h{x)\\l ® -(||ff(x)||;)2. 

Finally, note that using ([1]), ([5]), (fT9l) and (1^ . we get 

\\x + h{x) - y\\l = \\x-y\\l + 2{g{x),y-x) + {\\g{x)\\lf 

4.2 Main technical lemmas 


(23) 

(24) 


The following result concerns the expected value of a block-separable function when a random 
subset of coordinates is updated. 

Lemma 5 (Theorem 4 in [26]). Suppose that 4'(x) = X)r=i G R-'^, if we 

choose a uniform sampling S, then letting a = —we have 


E['I'(x -|- (/i(x))j^j)] = a4'(x -b h{x)) + (1 — a)4'(x). 


(25) 


The following technical lemma plays a central role in our analysis. The result can be viewed as 
a generalization of Lemma 3 in [39] . which considers the serial case (a = 1), to the parallel setting. 

Lemma 6. Let x G domF and x+ = x + (/i(x))j^p where S is any uniform sampling. Then for 
any y G dom F, 


E 


F{x+) + ii^\\x+-y\\l\ < F(x) + ^||x-y||) 


—a 


(^F{x)-F{y) + !^\\x-y\\l) 


(26) 


Moreover, 

(^) 

(ii) 


E [F(x+)] < F{x) - + l)\\h{xm = + l)(ll5(^)li:)^ 


(27) 


E[F{x+) + ^\\x+-y\\l] < Fix)+ ^\\x-y\\l - a{F{x) - F{y)) . 


(28) 
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Proof. We first note that 

E [||a:+- yll^] = a\\x + h{x) - y\\l + {1 - a)\\x - y\\l. (29) 

This is a special case of the identity E[^/;(n + = aif{u + /i) + (1 — (see Lemma El which 

holds for block separable functions if), with V'(u) = ||tt||^, u = x — y and h = h{x). 

Further, for any h for which x + /i G domT, we have 

ESI 

E[F(x +< {1 — a)F{x) + a'Hv{x,h). (30) 

This was established in |26l Section 5]. The claim now follows by combining (1301) . used with 
h = h{x), and the following estimate of T-Lv{x, h{x)): 

/(x) + (V/(x), h{x)) + l\\h{x)\\l + T(x + h{x)) 

f{y) + (V/(x),x - y) - ^||y - x\\l + (V/(x), h{x)) + ^||h(x)||^ 

+ ^{y) + (-V/(x) + 5 r(x),x + h{x) -y) - ^\\x + h{x) - y\\l 
F{y) + {g{x),x-y) + {g{x),h{x)) - ^||y-x||^ 

~^\\x + h{x) — y\\i + ^||/i(x)||^ 

F{y) + {g{x),x- y) - i^\\y - x\\l - ^||x + h{x) - y\\l - 5 (||ff(x)||;)^ 

F{y) + ^||y - x\\l - ^^||x + h{x) - y\\l 

F{y) + ^||y - x\\l - (E [||x+ - y\\l] - (1 - a)\\x - y\\l) . 

Part (i) follows by letting x = y and using (l2^ and (1^ . Part (ii) follows as a special case by 
choosing fif = fixj, = 0. □ 

Property (i) means that function values F{xk) of PCDM are monotonically decreasing in ex¬ 
pectation when conditioned on the previous iteration. 


'Hv{x, h{x)) 


Ip 

®+l[2Tll 

< 


Eit 


4.3 Proof of Theorem [3] 


Proof. Let x* be an arbitrary optimal solution of ([I]). Let = ||xfc — x*||^, g^ = g{xk), = h{xk) 
and Fk = F{xk). Notice that Xk+i = Xk + {hk)[s^.]- By subtracting E* from both sides of (I28I1 . we 
get 


TT 


1„2 


and taking expectations with respect to the whole history of realizations of Si,l < k gives us 


E 


ir? ,1 -I- Fi. I 1 




aE [Fk — E* 


Applying this inequality recursively and using the fact that E[Ej] is monotonically decreasing for 
j = 0,l,...,k + l we obtain 


E[Efc_|_i — E*] < E 


1„2 
2 'fc +1 


+ Fk+i — E* 


< + Eo - E* - a ^(E[E,] - E*) 

j=0 


< irg -h Eo - E* - a{k l)(E[Efc+i] - E*), 
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which leads to (fThl) . 

We now prove (ITT]) under the strong convexity assumption fif + >0. From 


we get 


E 


+ Ffc+i - F, I xfc] < + + . (31) 


Notice that for any 0 < 7 < 1 we have 

+ Ffc - F, = + F^ - F,) + (1 - + F^ - F,) 


ll9t+lfT0t 

> 


Choosing 


we obtain 


7 := 


+ Ffc - F* > 


7 + Fk — F*^ + (1 — 7)(m/ + mYI- 


2 ' k~<~ ^ k 

Combining the inequality above with (|3ip gives 


G [0,1] 

1 + /ij + 2/i^ 

7* {^rl + Ffc - F,) . 


E 


+ Ffc+i - F, I Xk] < (1 - 7*«) + Ffc - F*) . 


It now only remains to take expectation in Xk on both sides of (l33]l . and (fT7)l follows. 


(32) 


(33) 

□ 


5 High Probability Convergence Result 

Theorem [3] showed that the Algorithm [T] converges to the optimal solution in expectation. In this 
section we derive iteration complexity bounds for PCDM for obtaining an e-optimal solution with 
high probability. Let us mentioned that all existing [Ml EH EH [39] high-probability results for 
serial or parallel CDM require a bounded levelset, i.e. they assume that 

C{xo) = {x € : F(x) < F(xo)} (34) 

is bounded. In Section IQ we present the first high probability result in the case when the levelset 
can be unbounded (Corollary |9| and Corollary IIip . Then in Section [5.21 we derive a sharper high- 
probability result for PCDM of |26| if a bounded levelset is assumed (i.e. C{xo) is bounded). 

5.1 Case 1: Possibly unbounded levelset 

We begin by presenting Lemma 0 which will allow us to state the first high-probability result 
(Corollary Ej) for a PCDM applied to a convex function that does not require the assumption of a 
bounded levelset. 

Lemma 7. Let xq be fixed and be a sequence of random vectors in such that the 

conditional distribution of Xk+i on Xk is the same as conditional distribution of Xk+i on the whole 
history {xij^Q (hence we have Markov sequence). Let us define r^ = (t>r{xk) o,nd = 4>^{xk) where 
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are non-negative functions. Further, let us assume that following two inequalities 

holds for any k 


E [^rfc+i + 6+i|xfc] < irfc + (1 - O^k, 

E[^fc+i] < fk 


with some known C G (0,1). Then if 


K > 


1 

c 




then 


Proof. Using p5p we have 


< e) > I - p. 


(35) 

(36) 


(37) 


E[^fc] < E [^rfe + 


^ m 

j=0 


Hence 


E[ 6 ] < 


^rp + Co 
1 + kC ' 


Now, from the Markov inequality we have 


(38) 


P(a > e) < 


E[ei^] ® 1 iro + eo ® 
e - e 1 + KC - ^ 


□ 


Naturally, the result 0{-^) is very pessimistic and hence one may be concerned about tightness 
of the lemma. The following example, indeed, shows that Lemma [3 is tight, i.e. the bound on K 
cannot be improved much. (We construct an example that, under the assumptions (1351) and (l36]l 
(i.e., using the analysis of m), requires iterations.) 

Example 8 (Tightness of Lemma [7|)- Let us fix some small value of p £ (0,1) and assume that 
(ri,^i) have following distribution: 


(n,6) = 


(0,0), with probability 1 — p 
{2'd,e), otherwise, 


where d is chosen in such a way that (pSl) is satisfied. Then, we can chose it as follows 


p{d + e) = ^ro + (l-C)Co 


i}= 2 


iro + (1 - C)6 


— e. 


Now we define, for A: = 1, 2,3,... 

(^fc+i) ?fc+i) 


(?’fc-2Ce,e), if rk >2Ce 
(0,0), otherwise. 
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Now it is easy to verify that for 


K := 


1 

iro + (1 - C)?o ^ 


[CeJ 

c 

pe 


we have that P(^x > e) > p. 

Corollary 9 (High probability result without bounded levelset). If we use Lemma\^ with ^rk = 

1 9 El tSi 

4>r{xk) = 2 ll^fc “ = (t>^{xk) = F{xk) “ F* cLud C, = a = then we obtain that 

^ _!]_ ( ^\\xo - x4l + F{xo) - F^ 

~ E|5| \ pe 

we have P{F{xk) — F^ < e) > 1 — p. 

The negative aspect of Corollary [9] is the fact that one needs O(^) iterations, whereas classical 
results under the bounded levelset assumption require only 0 (log i) iterations. 



Multiple run strategy. Now we present a restarting strategy |25] trick which will give us high 
probability result 0(log 

Lemma 10. Let {rk}^=Q and {Cfcj^o same as in Lemma Assume that we 

observe r = [log different random and independent realisations of this sequence always starting 
from xq, i.e. for any k we have observed x\^x‘j,,... Then if 


K > 


1 

c 


( ko + go \ 

Iv <l/e) ) 


then 


min < e > 1 — P- 


Proof. Because the realisation are independent then for any I G {l,2,...,r}we have from Lemma [7] 
that > e) < Hence 


P i — P (Clr > f,K > e,..., > e) — JJ P — 

Corollary 11. If we run PCDM r = [log -] many times for K > 


E[\S\] 


< p.n 

i ||xo—r, \\l+F{xo)—Ft. 


e(l/e) 


- 1 


each, then the best solution we get, indexed I G {1,2 ,...,r}, satisfies P(T(x^^) — T* < e) > 1 — p. 


Hence, in total we need 


E[|5|] 


^\\xo- x,\\l+F{ xo)-F., 
e(l/e) 


- 1 


[logi] ~ O 


(logi 


iterations of PCDM. 
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5.2 Case 2: Bounded levelset 

The next result, Theorem m obtains the rate 0 (log under the assumption that the levelset is 
bounded. However, some results will hold only for a modified version of Algorithm [TJ In particular, 
we now present Algorithm [2l 


Algorithm 2 PCDM-M; Parallel Coordinate Descent Method [26] 

1 : choose initial point xq € 

2: for A: = 0,1, 2,... do 

3: randomly choose set of blocks Sk C {1,..., n} 

4: for i ^ Sk (in parallel) do 

5: compute = argmin^gj^iVi |((V/(xfc))W,t) + ^11^11^^) + ^iix^k +^)} 

6: end for 

7: if F{xk + T^ieSk Uih{xk)^''^) < F{xo) then 

8 : apply the update: x^+i ^ Xk + Uih{xkY''^ 

9: else 

10: set Xk+l ^ Xk 

11 : end if 

12 : end for 


Notice that the first 6 steps of Algorithm |2] are exactly the same as those of Algorithm [T] 
However, Algorithm [2] forces the iterates to stay in C{xo) (steps 7-11). 


Distance to the optimal solntion set. In some of the results derived in this Section we need 
the distance to the optimal solution set, inside the levelset, to be finite, i.e. 

TZyQ-.= max \ max llx — x* |L > < oo. (39) 

xe£(xo) 1 Xt€X* J 

Note that for any x* G X* (where X* is a set of optimal solutions) it trivially holds that ||xo—x*||^ < 
TZy^o- Moreover, for some problems the levelset can be unbounded, in which case is infinite, 
whereas if X* ^ 0 then ||xo — x*|| is always finite. 

Theorem 12. Let {xfc}fc>o be a sequence of iterates generated by 

• PCDM (Algorithmic, if F is strongly convex with > 0 or F is convex and a 

monotonic ESO is used, 


• PCDM-M (Algorithmic, if F is convex and a non-monotonic ESO is used. 

E[|5ll 

Let 0 < e < F{xo) — T* and p G (0,1) he chosen arbitrarily. Define a = and let 

c := m8ix{Tll Q,F{xo) - FC- 

Then 

(i) if F is convex and we choose 

> T (1 + log (ihizAlilTlATi)) + 2 - i, 

ae \ \ 2cp j j a 


(40) 


(41) 
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(a) or if F is strongly convex with >0 and we choose 


1 + gLf + 2/i^ ( i^llxo - x*||2 + F(xo) - -F* 

^ 2 «(, - 

then 

P{F{xk) - F* < e) > 1 - yo. (43) 

Proof. The proof proceeds as in [251 Theorem 1]. For convenience, let ■= F{xk) — F* and define 

ik, if f.k > e, 

0, otherwise. 

Notice that (,l < e ^ fk < £, k > 0. Using the Markov inequality, 

P{F{xk) - F, > e) = P(a > e) = P(a > e) < iE[e^], (44) 

so it suffices to hnd K such that 

BlfM < ep. (45) 

Using an ESO and Lemma 17 in [26] will give us 

E[efc+i|a:fc] < (l - ^) (46) 

It is easy to verify that (|46p and the definition of lead to (see the proof of [251 Theorem 1]) 

E[a+i|xA,] < (l-g)a, VA:>0. 

Taking expectation with respect to Xk on both sides of the above we get 

E[S+,| < (l - ^) EEl, Vk > 0. (47) 

In addition, using ([T6]) and the relation < f,k, we have 

E[^fc] < (^Iko+ ?o) , VA;>0. (48) 

Now for any t > 0, let 




Fi 


1 

a 




XQ - X* 


2 

V 





’2c ftV 

II 

— log - 


ae \pj 


It follows from ([48]) that E[^|^J < te, which together with (1471) implies that 


/ Q;e\^2 / Q:e\^2 

E[eWJxi.4] < (l - f) E[e|,J<(l--) te<pe. 

Notice that, by (I47)) . the sequence E[^|,] is decreasing. Hence, we have 


EiCk] < pe, yk > K{t), 


(49) 


(50) 


(51) 
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where 


K{t) := ^ Qll^o -^*11^ + F{xk) - F^j ~ (p) 

It is easy to verify that 

f*(= argmin/f(f)) := ^ ( i||xo - x*||^ + F(xo) --F* ) , (53) 

t>0 2c \2 J 

Because K > we see that (j^Sll holds and the proof of (i) is complete. 

Now we prove (ii). For convenience, set = fiq,{w). Then from (fT7)) . we have 

E[4+i|xfc] < (1 - ay) ||xo - + 6 ^ , (54) 

where 0 < 7 < 1 is defined in (f32]) . Taking expectation in Xk (and using recursion) gives E[(^fc+i] < 
(1 — ay)^ + ^ 0 ) • Finally, using the Markov inequality (jl3|), and K given in ()i2]l . 

we have 

P(^x > e) < ■^E[^ii'] < -^(1 - aj)^ + < P, (55) 

and the result follows. □ 

In this Section we have presented three new convergence results for PCDM. The first result 
shows that, using the analysis in [3^, PCDM obtains a O(^) rate when the levelset is unbounded 
for a single run strategy. The second result shows that PCDM obtains a 0(log 4) rate for a 
restarting strategy. 

On the other hand, if the levelset is bounded, we have shown that PCDM achieves a rate of 
0(log 4). It is still an open problem to determine whether PCDM can achieve a rate of 0(log 4) 
for a single run strategy when the levelset is unbounded. 

6 Discussion 

6.1 Comparison of the convergence rate results 

We have the following remarks on comparing the results in Theorem [3] with those in [26]. 

6.1.1 Comparison in the convex case 

For problem ([T|), an expected-value type of convergence rate is not presented explicitly in [25] . 
although it can be derived from the following relation (that is stated in [2^ and proved in |25l 
Theorem 1]): 

ElF(xk+i) - F*\xk] < {F{xk) - F*) - -L, \/k > 0 , (56) 

2c 

where c is defined in (j40h . Taking expectation on both sides of (I56h and using a similar argument 
as that in [18], gives 

E[F(xfc) - F*\xk-i] < — ^^(-^(^ 0 ) ~ ^ ) yk > 0. (57) 

^ ^ ^ ^ - 2c + ak{F{xo)- F*)' “ ^ ^ 
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Let a and b denote the right hand side of (fT6l) and (|57)l respectively. By the definition of c and the 
relation ||xo — we see that when k is sufficiently large, 


6 _ 4c 4 

a "" ||xo - x*||2 + 2{F{xo) - F*) ~ 3’ 


(58) 


6.1.2 Comparison in the strongly convex case 

For the special case of m where at least one of / and 4' is strongly convex (i.e., fif + fiq, > 0 ), 
Richtarik and Takac [26] showed that for all A; > 0, there holds 

E[F{xk) - F*\xk-i] < (F(xo)-F*). (59) 

It is not hard to observe that 

2 (/i/ + fj.^) ^ 

1 + fif + 2fiq; 1 + /ivji 

Recall that 7 is defined in (1321) . Then it follows that for sufficiently large k one has 

(l-a7)'-'(L^/iJ + F{a!„)-F-) f (1 - a^)* (F(i„) - F*). 

V 2 J V /^Z + Ai'i' / 

6.2 Comparison of the iteration complexity results 

Here we compare the results in Theorem [12] with those in [26]. 


Comparison in the convex case. For any 0 < e < F{xq) — T* and p G (0,1), Richtarik and 
Takac [26] showed that (1431) holds for all A; > where 


K:= — fl + logf- 
ae V \P 


+ 2 - 


2c 


a(F(xo) - T;)' 

Using the definition of c and the fact that ||xo — < TZy^g we observe that 

||xo - x^Wl + 2^0 , 3 

T :=- < —. 

4c “4 


(61) 


(62) 


By the definitions of K and K we have that for sufficiently small e > 0, 


K -K 


2c logr ^ 2clog(4/3) 
ae ~ ae 


(63) 


In addition, ||xo — x*!!.,; can be much smaller than TZy^o and thus r can be very small. It follows 
from the above that K can be significantly smaller than K. 
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Comparison in the strongly convex case. In the strongly convex case (i.e., > 

0), Richtarik and Takac showed that ()43l) holds for aW. k > K where 


a + fJ.q,{w) 

We can see that for /? or e sufficiently small we have 


F{xo) - F* 


K ^ I + fif{w) + 


ep 


< 1 , 


K 2(1 + ij,^[w)) 
because /r/ < 1, which demonstrates that K is smaller than K. 


(64) 


7 Numerical experiments 

In this Section we present preliminary computational results. The purpose of these experiments 
is to provide a numerical comparison of the performance of PCDM, under the different ESOs 
summarized in Appendix lA.21 

Least squares. Consider the following convex optimization problem min — llAx — 6III, where 

2 

A € r8-io3x2-io 3_ Each row has between 1 and cj = 20 nonzero elements (uniformly at random). For 
simplicity, we normalize (in norm) all the columns of A. The value of cr = Amax(^^^) = 10.48. 
We have compared 5 different approaches which are given in Table [3l Parameter r = 512 and hence 


Table 3: Approaches used in the numerical experiments. 


Name 

V 

Note 

BKBG 

vbkbg = L 

This is naive approach, which was proposed in 
[T] and |22j. Note that this is not ESO. 

RT-P 

VRT-P - (1 + 

Theorem 1181 originally derived in |26j. 

RT-D 

VRT-D - (1 + Lx|i!n-11 

Derived in [23] as a special case for C = 1. 

FR 

VpR = L 

Theorem proposed in [3| and generalized 

in this paper (Theorem [T9|) . 

NC 

VNC = L 

Theorem [H] proposed in [15] . 


1 + il = 5.856 for RT-P and 1 + = 3.424 for RT-D approach. The distribution 

of vectors v can be found in Figured] (right). Figure [T] shows the evolution of F{xk) — F* for all 
5 methods. Note that the BKBK did not converge. The speed of RT-P, RT-D and FR is quite 
similar and NC is approximately 3 times worse because vnc ~ 

SVM dual. In this experiment we compare 4 methods from Table [3|(we have excluded the naive 
approach because it usually diverges for large r) on a real-world dataset astro-ph, which consists of 
data from papers in physics [29]. This dataset has 29,882 training samples and a total of 99,757 
features. This dataset is very sparse. Indeed, each sample uses on average only 77.317 features and 
each sample belongs to one of two classes. Hence, one might be interested in finding a hyperplane 
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Figure 1: Evolution of F{xk) 


F* for 5 different methods (left) and distribution of v (right). 


that separates the samples into their corresponding classes. The optimization problem can be 
formulated as follows: 

A 1 ^ 

minP(rc) := — ||rc||| H— maxiO, 1 — aj-\w}, (65) 

w 2 Tl 

i=l 

where € {—1,1} is the label of the class to which sample E R™ belongs. 

While problem formulation (|65p does not fit our framework (the nonsmooth part is nonsepara- 
ble) the dual formulation (see [5l|30l|3T]) does: 

max D(x) := —l^x-;— kX^Qx, (66) 

where Q E a[j))- In particular, problem formulation (l66l) is the sum of 

a smooth term, and the restriction x E [0,1]^ can be formulated as a (block separable) indicator 
function. In this dataset, each sample is normalized, hence L = (1,..., 1)^. 

For any dual feasible point x we can obtain a primal feasible point w{x) = ^ o,(i)- 

Moreover, from strong duality we know that if x* is an optimal solution of (I66p . then w* = w{x*) is 
optimal for problem (165} . Therefore, we can associate a gap G{x) = P{w{x)) — D{x) to each feasible 
point X, which measures the distance of the objective value from optimality. Clearly G{x*) = 0. 

Figure [2] (left) shows the evolution of G{xk) as the iterates progress, and the distribution of 
ESO parameter v for different choice of r E {32,256}. Naturally, as r increases, the distribution 
of v,v shifts to the right, whereas the distribution of v is not influenced by changing r. The value 
of important parameters for other methods are a = 287.273 and u = 29881. Eor t = 32 we have 
1 + = 31.998 for RT-P and 1 + = 1.296 for RT-D approach and for r = 256 we 

have 1 + = 255.991 for RT-P and 1 + = 3.443 for RT-D approach. Again the 

best performance is given by RT-D which requires knowledge of a. If we do not want to estimate 
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Figure 2; Comparison of evolution of G{xk) for various methods and the distribution of v. 
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parameter a then we should use FR. If was shown in [3] that for quadratic objective function FR 

is always better than RT-P. 
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A Expected Separable Over approximation 

A.l Smoothness assumptions 

In this work we assume that the function / is partially separable and smooth, and the purpose of 
this section is to define these two concepts. We begin with the definition of partial separability for 
a smooth convex function, introduced by Richtarik and Takac in [26] . 

Definition 13 (Partial separability [26]). A smooth convex function f : —>■ R fs partially 

separable of degree a; if there exists a collection J of subsets of {1,2,..., n} such that 

fix) = f.j{x) and max[J[ < a;, (67) 

fTj 

where for each J, fj is a smooth convex function that depends on for i £ J only. 

Now we introduce different types of smoothness assumptions for the function /. Each smooth¬ 
ness type gives rise to a different ESO. Note that all of the following smoothness assumptions are 
equivalent. That is, if a given function satisfies one of the assumptions, then there exist constants 
such that the other assumptions also hold. 

The first type of assumption is a classical assumption in the literature [22l [25l [26] . 

Assumption 14 ((Block) Coordinate-wise Lipschitz continuous gradient). The gradient of f is 
block Lipschitz, uniformly in x, with positive constants Li,...,L„. That is, for all x € R^, 
z = 1,..., n and h G R'^* we have 

||(V/(x + [/,h))« - (V/(x))«l|^,) < L,[|/z||(,), (68) 

where V/(x) denotes the gradient of f and 

(V/(x))« = C/fV/(x)€R'^L (69) 

The second type of assumption we make is that each function in the sum (1671) has a Lipschitz 
continuous gradient. Such an assumption is made, for example, in [ZllHlIisl 115] . Moreover, we 
allow each function to have Lipschitz continuous gradient with a different constant (which was also 
assumed in [T5]L 

Assumption 15 (Lipschitz continuous gradient of sub-functions). The gradient of fj,J € ff has 
a Lipschitz continuous gradient, uniformly in x, with positive constant Lj with respect to some 
Euclidean norm || • [[^j^. That is, for all x G R-^, J ^ J and h G R^ we have 

||V/j(x + h)- Vfj{x)\\lj^ < Lj\\h\\^jy (70) 
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Note that this smoothness assumption is more general than that made in m because of the 
possibility of choosing general norms of the form || • Further, Assumption 1151 generalizes the 
smoothness assumptions imposed in mm- 

The third type of assumption we make is that each function in the sum (I67p has coordinate-wise 
Lipschitz continuous gradient. 

Assumption 16 ((Block) Coordinate-wise Lipschitz continuous gradient of sub-functions). The 
gradient of fj, J ^ J is block Lipschitz, uniformly in x, with non-negative constants , Lj,n- 

That is, for all x G R^, i = 1,..., n, J G J and h G we have 

||(V/j(x + [/,h))« - (V/j(x))«||^,) < Lj,||h||(,). (71) 

One can think of Assumptions [U] and [15] as being ‘opposite’ to each other in the following 
sense. If we associate the block coordinates with the columns, and the functions with the rows, 
we see that Assumption 1141 captures the dependence columns-wise, while Assumption 1151 captures 
the dependence row-wise. Hence, Assumption 1161 can be thought of as an element-wise smoothness 
assumption. 

To make this more concrete, we present an example that demonstrates how to compute the 
Lipschitz constants for a quadratic function, under each of the three smoothness assumptions 
stated above. 

Example 17. Let the function f{x) = |||Ax —5||| = ^ where A G 

and Oj^i is {j,i)th element of the matrix A. Let us fix all the norms || • ||^j^ from Assumvtion\15\ 
to be standard Euclidean norms. Then one can easily verify that equations (IMD, (IZQD and (HID are 
satisfied with the following choice of constants 


i=i 


"J,*’ 


n 

2=1 


Lj^i — a 


'J,*' 


In words, Li is equal to square of the £2 norm of ith column, Lj is equal to the square of the £2 
norm of the ith row and Lj^i is simply the square of the {i,i)th element of the matrix A. 

One could be misled into believing that Assumption [16] is the best because it is the most 
restrictive. However, while this is true for the quadratic objective shown in Example [T71 for 
a general convex function, Assumption [16] can give Lipschitz constants that lead to worse ESO 
bounds (see Example [22] for further details). 


A.2 Expected Separable Overapproximation (ESO) 

Now, it is clear that the update h in Algorithm [1] depends on the ESO parameter v. This shows that 
the ESO is not just a technical tool; the parameters are actually used in Algorithm [T] Therefore 
we must be able to obtain/compute these parameters easily. We now present the following three 
theorems, namely Theorems 1181 [20] and 1211 that explain how to obtain the v parameter for a r-nice 
sampling, under different smoothness assumptions. 

Theorem 18 (ESO for a r-nice sampling. Theorem 14 in [26|). Let Assumptions^ hold with 
constants Li,...,L„ and let S be a r-nice sampling. Then f : R admits an ESO with 
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respect to the sampling S with parameter 


v = 1 + 


juj - 1 )(t - 1 ) 

max{l, n — 1} 


L, 


where L = (Li,..., LnY'. 

The obvious disadvantage of Theorem [TB] is the fact that v in the ESO, depends on uj. (When 
u) is large, so too is v.) One can imagine a situation in which w is much larger than the average 
cardinality of J € resulting in a large v. For example, if | J| for J € is small for all but one 
function. 

With this in mind, we introduce a new theorem that shows how the ESO in Theorem [18] can 
be modified if we know that Assumption 1161 holds. In this case, the role of w is slightly suppressed. 

Theorem 19 (ESO for a doubly uniform sampling). Let Assumption [T^ hold with constants 
Lj,i, J € iT, i G {1,..., n} and let S be a doubly uniform sampling. Then f : R admits an 

ESO with respect to the sampling S with parameter 


®=E 

JGJ 


1 + 


E[|SP] 

E[|S|] 


-1 (|J|-1) 


max{l, n — 1} 


... ^Lj^nY 


(72) 


Proof. From Theorem 15 in |26] we know that for each function /j, J G 77 we have 


(fi* -') (i-'i -. 


Now, using Theorem 10 in [26], which deals with conic combinations of functions, we have 

\ 


/ \ 

( 

^/j,^ -^ESO 

'.E 

\J£J J 



1 + 


max{l, n — 1} 


(Ej,l, • • • ,Lj^nY 




n 


The following Theorem is a special case of Theorem [T9| for a r-nice sampling. 

Theorem 20 (ESO for a r-nice sampling. Theorem 1 in |3|). Let Assumption[I^hold with constants 
Lj,i, J G 77, z G {1,..., n} and let S be a r-nice sampling. Then f : —>■ R admits an ESO with 

respect to the sampling S with parameter 


JGJ 



(r-l)(|J|-l) \ 
max{l, n — 1} J 


..., Lj^n)^ 


(73) 


Proof. Notice that, if S is r-nice sampling, then E[|S|] = r and E[|S|^] = r and the result follows 
from Theorem m □ 
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The following theorem explains how to compute an ESO if Assumption [15] holds. This ESO 
was proposed and proved in [El- 

Theorem 21 (ESO for r-nice sampling, Lemma 1 in |15ji. Let Assumption[T^hold with constants 
Lj, J ^ J and let S be a r-nice sampling. Then f : > R admits an ESO with respect to the 

sampling S with parameter 

v= ^Lje[j], 

J&J 

where e = (1,..., 1)^ E R*^. Moreover, this ESO is monotonic. 

As it is shown in Theorem [3| the speed of the algorithm (number of iterations needed to solve 
the problem) depends on ESO parameter v via the term ||xo — Moreover, for a given objective 
function and sampling S, there may be more than one ESO that could be chosen. Suppose that 
we have two ESOs to choose from, characterized by two parameters Va and Vb respectively, and let 
Va < Vb- In this case, the ESO characterized by Va will give us (theoretically) faster convergence, 
and so it is obvious that this ESO should be used. Furthermore, Va is used as a parameter in 
Algorithm [H and so, intuitively, this faster theoretical convergence, is expected to lead to fast 
practical performance. 

In Section 6.1 in [3] it was shown that for a quadratic objective, the ESO in Theorem 1181 is 
always worse than the ESO from Theorem 1201 However, for a general objective the opposite can 
be true. The following simple example shows that the ESO from Theorems [20] and [21] can be m 
times worse than the ESO from Theorem 1181 

Example 22. Consider the function 


m 


/(^) = 

1 = 1 '— 


fjO) 


where ( is large. It is clear that Lj is 


e^+O 1 

Lj = max( fi(x))" = max —^--;r = -. 

^ X X (e^ + e *)2 4 


Therefore, from Theorem \211 we obtain v = L = 

On the other hand. Theorem [7^ vroduces an ESO with 

V = max(/(x))" 

provided that ( is large, e.g. ( = 100. Hence, in this case, the ESO from Theorem. UR will lead to 
an algorithm that is approximately m times faster than if the ESOs from Theorems or[^ were 
used. 

Remark. A thorough discussion of the ESO is presented in [26l Section 4]. Moreover, |26l 
Section 5.5] presents a list of parameters v associated with a particular / and sampling scheme S 
that give rise to an ESO. Indeed, each of samplings described in Section [3.11 in this work gives rise 
to a u for which an ESO exists. 
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